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Abstract 

In the practice of point prediction, it is desirable that forecasters receive a 
directive in the form of a statistical fnnctional, snch as the mean or a qnantile 
of the predictive distribntion. When evalnating and comparing competing 
forecasts, it is then critical that the scoring function used for these purposes 
be consistent for the functional at hand, in the sense that the expected score 
is minimized when following the directive. 

We show that any scoring function that is consistent for a quantile or an 
expectile functional, respectively, can be represented as a mixture of extremal 
scoring functions that form a linearly parameterized family. Scoring func¬ 
tions for the mean value and probability forecasts of binary events constitute 
important examples. The quantile and expectile functionals along with the re¬ 
spective extremal scoring functions admit appealing economic interpretations 
in terms of thresholds in decision making. 

The Choquet type mixture representations give rise to simple checks of 
whether a forecast dominates another in the sense that it is preferable under 
any consistent scoring function. In empirical settings it suffices to compare 
the average scores for only a finite number of extremal elements. Plots of 
the average scores with respect to the extremal scoring functions, which we 
call Murphy diagrams, permit detailed comparisons of the relative merits of 
competing forecasts. 

Key words and phrases: Choquet representation; consistent scoring func¬ 
tion; decision theory; economic utility; elicitable; expectile; forecast ranking; 
order sensitivity; point forecast; probability forecast; quantile 
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1 Introduction 


Over the past two decades, a broad transdisciplinary consensus has developed that 
forecasts ought to be probabilistic in nature, i.e., they ought to take the form of 
predictive probability distributions over future quantities or events (Gneiting and 
Katzfuss 2014). Nevertheless, a wealth of applied settings require point forecasts, 
be it for reasons of decision making, tradition, reporting requirements, or ease of 
communication. In this situation, a directive is required as to the specihc feature 
or functional of the predictive distribution that is being sought. 

We follow Gneiting (2011) and consider a functional to be a potentially set¬ 
valued mapping T(F) from a class of probability distributions, to the real line, 
M, with the mean or expectation functional, quantiles, and expectiles being key 
examples. Gompeting point forecasts are then compared by using a nonnegative 
scoring function, S{x, y), that represents the loss or penalty when the point forecast 
X is issued and the observation y realizes. A critically important requirement on 
the scoring function is that it be consistent for the functional T relative to the class 
in the sense that 

E^[S(f,y)] <E^[S(a;,y)] (1) 

for all probability distributions F G all f G T(F), and all a; G M. If equality in 
([^ implies that x G T(F), then the scoring function is strictly consistent. 

To give a prominent example, the ubiquitous squared error scoring function, 
S{x,y) = {x — yY, is strictly consistent for the mean or expectation functional 
relative to the class of probability distributions with hnite variance. However, there 
are many alternatives. In a classical paper. Savage (1971) showed, subject to weak 
regularity conditions, that a scoring function is consistent for the mean functional 
if and only if it is of the form 

^{x,y) = (t){y) - (t){x) - (t)'{x){y - x), (2) 

where the function 0 is convex with subgradient 0'; squared error arises when 
0(f) = Holzmann and Eulert (2014) proved that when forecasts make ideal 
use of nested information bases, the forecast with the broader information basis is 
preferable under any consistent scoring function. 

However, in real world settings, as pointed out by Patton (2015), forecasts are 
hardly ever ideal, and the ranking of competing forecasts might depend on the 
choice of the scoring function. This had already been observed by Murphy (1977), 
Schervish (1989), and Merkle and Steyvers (2013), among others, in the important 
special case of a binary predictand, where y = 1 corresponds to a success and 
?/ = 0 to a non-success, so that the mean of the predictive distribution provides a 
probability forecast for a success. As there is no obvious reason for a consistent 
scoring function to be preferred over any other, this raises the question which one 
of the many alternatives to use. 

Our work is motivated by the quest for guidance in this setting. Theoretically, 
the respective key result is that, subject to unimportant regularity conditions, any 
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function of the form ([^ admits a mixture representation of the form 

/ + 00 

S0{x,y) dH{9), 

■OO 

where if is a nonnegative measure, and 

Se{x,y) = {y-9)^-{x-9)^-l{x> 9){y-x) 

{ \y — 9\, min(a;, y) <9 < max(a;, y), 

0, otherwise 

for 9 E M.. (Here and in what follows, we write (t)+ = max(t, 0) for the positive 
part of t G M and l(^) for the indicator function of the event A.) Thus every 
scoring function consistent for the mean can be written as a weighted average over 
elementary or extremal scores Sg. As an important consequence, a point forecast 
that is preferable in terms of each extremal score Sg is preferable in terms of any 
consistent scoring function. The elementary scores can be seen as representing the 
loss, relative to an oracle, in an investment problem with cost basis 9 and future 
revenue y] see Section 2.3. 

In empirical settings, point forecasts are compared based on their average scores. 
Specihcally, let us consider a sequence of triplets {xii,Xi 2 , yt) for i = 1,..., n, where 
Xii and Xi2 are competing point forecasts and yt is the subsequent outcome. We may 
compare the two forecasts graphically, by plotting the respective empirical scores, 

1 

Sj{0) =-'^Sg{xij,yi) (3) 

i=l 

for j = I and 2, versus 9. An example of this type of display, which we term a 
Murphy diagram^ is shown in Figure [T| where we consider point forecasts of wind 
speed at a major wind energy center. 

More generally, for both quantiles and expectiles the apparent wealth of con¬ 
sistent scoring functions can be reduced to a one-dimensional family of readily 
interpretable elementary scores, in the sense that every consistent scoring function 
can be represented as a mixture from that family. The case of the mean or expec¬ 
tation functional, which includes probability forecasts for binary events as a further 
special case, corresponds to the expectile at level a = 1/2. 

The remainder of the paper is organized as follows. Section is devoted to 
the key theoretical development, in which we state and discuss the mixture repre¬ 
sentations, relate to Choquet theory and order sensitivity, and provide economic 
interpretations of the elementary scores and the associated functionals. In partic¬ 
ular, we show that expectiles are optimal decision thresholds in binary investment 
problems with hxed cost basis and differential taxation of prohts versus losses. In 
Section]^ we apply the mixture representations to study forecast rankings and pro¬ 
pose the aforementioned Murphy diagram for forecast comparisons. Illustrations 
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Figure 1: Murphy diagrams for the comparison of point forecasts of wind speed at the 
Stateline wind energy center, using a regime-switching space-time (RST) or autoregressive 
(AR) technique (Gneiting et al. 2006). The functional considered is the mean of the 
respective predictive distribution. Left: Empirical scores Sj{9) in (|^ versus 6. Right: 
Score differences along with pointwise 95% confidence bands. A negative difference means 


that the RST forecast is preferable. For details, see Sections 3.4 and 4.3 


on data examples follow in Section]^ where we revisit meteorological and economic 
case studies in the work of Gneiting et al. (2006), Rudebusch and Williams (2009), 
and Patton (2015). The paper closes with a discussion in Section]^ Proofs and 
computational details are deferred to Appendices. 


2 Consistent scoring functions for quantiles and ex- 
pectiles 

Before focusing on the specihc cases of quantiles and expectiles, we review gen¬ 
eral background material on the assessment of point forecasts, with emphasis on 
consistent scoring functions. 

2.1 Consistent scoring functions 

We hrst introduce notation and expose conventions. Let Tq denote the class of the 
probability measures on the Borel-Lebesgue sets of the real line, M. For simplicity, 
we do not distinguish between a measure F G and the associated cumulative 
distribution function (CDF). We follow standard conventions and assume CDFs to 
be right-continuous. A function S dehned on a rectangle D = Di x D 2 G is 
called a scoring function if S{x,y) > 0 for all {x,y) E D with S(a:,|/) = 0 if x = ?/. 
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Here, S{x,y) is interpreted as the loss or cost accrued when the point forecast x is 
issued and the observation y realizes. The scoring function is regular if it is jointly 
measurable and left-continuous in its hrst argument, x, for every y. 

In point prediction problems, it is rarely evident which functional of the predic¬ 
tive distribution should be reported. Guidance can be given implicitly, by specifying 
a loss function, or explicitly, by specifying a functional. The notion of consistency 
originates in this setting. 

Consider a functional F i—)■ T(F) C on a class F F Fq on which the mapping 
is well-dehned. Usually, the functional is single-valued, as in the case of the mean 
functional where we take F as the class F\ of the probability measures with hnite 
hrst moment. More generally, the expectile at level a G (0,1) of a probability 
measure F & Fi is the unique solution t to the equation 

/ t POO 

{t - y) dF{y) = a (y - t) dF{y ), 

oo J t 

where a = 1/2 corresponds to the mean functional (Newey and Powell 1987). In 
the case of quantiles, the functional might be set-valued. Specihcally, the quantile 
functional at level a G (0,1) maps a probability measure F to the closed interval 
[QaFFaplj with lower limit q~p = sup {s : F{s) < a} and upper limit q//p = 
sup {s : F{s) < a}. The two limits differ only when the level set F~^{a) contains 
more than one point, so typically the functional is single-valued. Any number 
between q~p and q//p represents an a-quantile and will be denoted qa,F- 

The scoring function S is consistent for a functional T relative to the class F if 

EF[S(t,U)] < E,.[S(a;,U)] (4) 

for all probability measures F G -T, all t G T(F), and all point forecasts x ^ D\. 
A functional T that admits a strictly consistent scoring function is called elicitable, 
and can then be represented as the solution to an optimization problem, in that 

T(F) = argminj,Ej7’[S(a;,U)]. 

Hence, if the goal is to minimize expected loss, the optimal strategy is to follow the 
requested directive in the form of a functional. 

In what follows, we restrict attention to the quantile and expectile functionals. 
These are critically important in a gamut of applications, including quantile and 
expectile regression in general, and least squares (i.e., mean) and probit and logit 
(i.e., binary probability) regression in particular. 

2.2 Mixture representations 

The classes of the consistent scoring functions for quantiles and expectiles have been 
described by Savage (1971), Thomson (1979), and Gneiting (2011), and we review 
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the respective characterizations in the setting of the latter paper, where further 
detail is available. 

Up to mild regularity conditions, a scoring function S is consistent for the quan¬ 
tile functional at level a G (0,1) relative to the class Tq if and only if it is of the 
form 

S{x, y) = {l{y <x)- a) {g{x) - g{y)), (5) 

where g is non-decreasing. The most prominent example arises when g{t) = t, 
which yields the asymmetric piecewise linear scoring function. 


S(a;,2/) 


(l-a)(a;- 2 /), y < x, 
aiy-x), y>x, 


( 6 ) 


that lies at the heart of quantile regression (Koenker and Bassett 1978; Koenker 
2005). Similarly, a scoring function is consistent for the expectile at level a G (0,1) 
relative to the class if and only if it is of the form 


S(a;,|/) = |1(|/ <x)-a\ {(j){y) - (j){x) - (j)'{x){y - x)), (7) 

where 0 is convex with subgradient 0'. The key example arises when 0(f) = 
where 


f (1 - a) (a; - y)^, y < x. 
\ a (x - yf, y>x. 


( 8 ) 


This is the loss function used for estimation in expectile regression (Newey and 
Powell 1987; Efron 1991), including the ubiquitous case a = 1/2 of ordinary least 
squares regression. 

In view of the representations ([^ and ([^, the scoring functions that are consis¬ 
tent for quantiles and expectiles are parameterized by the non-decreasing functions 
g, and the convex functions 0 with subgradient 0 ', respectively. In general, neither 
g nor 0 and 0 ' are uniquely determined. We therefore select special versions of 
these functions. Furthermore, in the interest of simplicity we generally assume that 
Di X ZI 2 = adding comments in cases where there are hnite boundary points. 
Let X denote the class of all left-continuous non-decreasing real functions, and let 
C denote the class of all convex real functions 0 with subgradient 0 ' G X. This last 
condition is satished when 0' is chosen to be the left-hand derivative of 0, which 
exists everywhere and is left-continuous by construction. 

In what follows, we use the symbol to denote the class of the scoring func¬ 
tions S of the form (|^ where g E I. Similarly, we write for the class of the 
scoring functions S of the form ([^ where (f) E C. For all practical purposes, the 
families and can be identihed with the classes of the regular scoring functions 
that are consistent for quantiles and expectiles, respectively. These classes appear 
to be rather large. However, in either case the apparent multitude can be reduced 
to a one-dimensional family of elementary scoring functions, in the sense that ev¬ 
ery consistent scoring function admits a representation as a mixture of elementary 
elements. 
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Theorem la (quantiles). Any member of the class admits a representation 
of the form 

/ + 00 

{x,yeR), (9) 

■OO 

where H is a nonnegative measure and 

= (1(2/ < a:) - a) (1(0 < x) - 1(0 < y)) 

1 — 0, 1/ < 0 < a;, 

= < a, X < 9 < y, (10) 

0, otherwise. 

The mixing measure H is unigue and satisfies dH{ 6 ) = dg{ 6 ) for 0 G M, where 
g is the nondecreasing function in the representation &■ Furthermore, we have 
H{x) - H{y) = S(x, y)/{l - a) for x > y. 

Theorem lb (expectiles). Any member of the class admits a representation 
of the form 

r+00 

^{x,y)= Sl 0 {x,y)dH{e) {x,yeR), 


( 11 ) 


where H is a nonnegative measure and 

y) = 11(2/ < a:) - a| {{y - 0)+ - (x - 0)+ - {y - x) 1(0 < x)) 

{l-a)\y-e\, y<e<x, 

= ^ a\y-e\, x<9 <y, 

0, otherwise. 


( 12 ) 


The mixing measure H is unigue and satisfies dH{ 6 ) = d0'(0) for 9 E R, where 
(j)' is the left-hand derivative of the convex function cf) in the representation 0. 
Furthermore, we have H{x) — H{y) = d 2 d{x,y)/{1 — a) for x > y, where 82 
denotes the left-hand derivative with respect to the second argument. 


Note that the relations in ([^ and ([II| hold pointwise. In particnlar, the 


re¬ 


spective integrals are pointwise well-dehned. This is because for (x, y) G the 
functions 0 i-t {x,y) and 0 h->■ S^g(x,|/) are right-continuous, non-negative, and 
uniformly bounded with bounded support, and because the non-decreasing func¬ 
tions g and cf' define non-negative measures dg and dcf' that assign hnite mass to 
any hnite interval. 

In the case of quantiles, the asymmetric piecewise linear scoring function corre¬ 
sponds to the choice g{t) = t in (|^, so the mixing measure H in the representation 
(9) is the Lebesgue measure. The elementary scoring function arises when 
^f) = 1{9 < t), i.e., when if is a one-point measure in 0. 

In the case of expectiles, the mixing measure for the asymmetric squared error 
scoring function is twice the Lebesgue measure. The choice a = 1/2 recovers the 
mean or expectation functional, for which existing parametric subfamilies emerge as 
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special cases of our mixture representation. Patton’s (2015) exponential Bregman 
family, 


Sa{x,y) = — (exp(ar/) - exp(aa;))- exp{ax){y - x) {a ^ 0 ), 

a 


which nests the squared error loss in the limit as a —?• 0 , corresponds to the choice 
(j){t) = a“^exp(at) in ([^. The mixing measure H in the representation ( |TT] ) then 
has Lebesgue density h{6) = exp{ad) for 0 G M. For Patton’s (2011) family 


Sb{x,y) 


X 


X 


6-1 


6 ( 6 - 1 ) 6-1 

y , y . 

-log- 1 , 

X X 

2 /log - - (y-x), 

X 


(y 


X 


6 = 0 , 

6 = 1 , 


of homogeneous scoring functions on the positive half line the mixing measure has 
Lebesgue density h{6) = > 0), remarkably with no case distinction being 

required. The elementary scoring function emerges when — in (7); 

here the mixing measure in ( [II| is a one-point measure in 9. 

From a theoretical perspective, a natural question is whether the mixture rep¬ 
resentations ([^ and ( [II] ) can be considered Choquet representations in the sense of 
functional analysis (Phelps 2001). Recall that a member S of a convex class S is 
an extreme point of S if it cannot be written as an average of two other members, 
i.e., if S = (Si -|- 82)72 with 81,82 G S implies Si = S 2 = 8 . Our mixture rep¬ 
resentations qualify as Choquet representations if the elementary scores S^g and 
S^g form extreme points of the underlying classes of scoring functions. This can¬ 
not possibly be true for our classes and because they are invariant under 
dilations, hence admit trivial average representations built with multiples of one 
and the same scoring function. Therefore, the families and need to be re¬ 
stricted suitably. 8 peci£cally, let the class Xi consist of all functions g E I such 
that g{.x) = 0 and lima-^+oo 5 '(a^) = 1 - 8 imilarly, let Ci denote the family 

of all 0 G C such that 0(0) = 0 and 0' G Xi. These classes are convex, and so are 
the associated subclasses of the families and which we denote by 5^^ and 
5^;^, respectively. The elementary scores 8 ^g and 8 ^g evidently are members of 
these restricted families. 


Proposition la (quantiles). For every a G (0,1) and 0 G M, the scoring function 
8 ^g is an extreme point of the class 

Proposition lb (expectiles). For every a G (0,1) and 0 G M, the scoring 
function 8 ^g is an extreme point of the class 

We thus have furnished Choquet representations for subclasses of the consistent 
scoring functions for quantiles and expectiles. In the extant literature, such Choquet 
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representations have been known in the binary case only, where y = 1 corresponds 
to a success and ?/ = 0 to a non-success, so that the mean, p G [0,1], of the predictive 
distribution provides a probability forecast for a success. In this setting, the Savage 
representation 0 for the members of the respective class i reduces to 

sb, 0) = ^ bbb) - 0(p)), sb, 1) = ^ b(i) - (t){.p) - (1 - bbb))- 


The mixture representation (11) can then be written as 

S(p,9)= / S„“(p,9)dff(9), 

Jo 

where H is a nonnegative measure and 

( e, y = o^p>e, 

Se b, y) = 2Sf/2,0(p, 2/) = < 1 - 2/ = 1 , P < ^, 

I 0 , otherwise. 


(13) 


( 14 ) 


The parameter 6 G (0,1) can be interpreted as the cost-loss ratio in the classical 
simple cost-loss decision model (Richardson 2012). Up to unimportant conventions 
regarding coding, scaling, and gain-loss orientation, this recovers the well known 
mixture representation of the proper scoring rules for probability forecasts of binary 
events (Shuford, Albert, and Massengill 1966; Schervish 1989). Different choices of 
the mixing measure yield the standard examples of scoring rules in this case; see 
Buja et ah (2005) and Table 1 in Gneiting and Raftery (2007). The widely used 
Brier score, 

S(p,0)=b, S(p, 1) = (1-p)^ (15) 

arises when H is twice the Lebesgue measure. 

We close the section by noting a fundamental connection between the extremal 


scoring rules for quantiles, expectiles, and probabilities in (10), (12), and (14), 


respectively. Specifically, given any predictive CDF, F, and outcome, y G 

S2<,(C.f.p) = 2Sf/y_„(l - F(»), l(v > 0)) 

for every a G (0,1) and 0 G M. We will revisit this relation repeatedly. 


(16) 


2.3 Economic interpretation 

Our results in the previous section give rise to natural economic interpretations of 
the extremal scoring functions and along with the quantile and expectile 
functionals themselves. In either case, the interpretation relates to a binary betting 
or investment decision with random outcome, y. 


In the case of the extremal quantile scoring function in (10), the payoff takes 
on only two possible values, relating to a bet on whether or notthe outcome y will 
exceed the threshold 9. Specifically, consider the following payoff scheme, which is 
realized in spread betting in prediction markets (Wolfers and Zitzewitz 2008); 
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• If Quinn refrains from betting, his payoff will be zero, independently of the 
outcome y. 

• If Quinn enters the bet and y <9 realizes, he loses his wager, > 0. 

• If Quinn enters the bet and y > 9 realizes, his winnings are pc > Pi, for a 
gain of Pg- Pl- 


How should Quinn act under this payoff scheme? If Quinn does not enter the 
bet, his actual and expected payoffs equal zero. If he does enter, his expected payoff 


IS 


-pL / dF{y) + (pg - Pl) / dF{y), 

J — oo J 6 

where F is Quinn’s predictive CDF for the future outcome, p, which for simplicity 
we assume to be strictly increasing. This expression is strictly positive if and only 
if qa^F > 9, where 

^^“^^=(0,1). (17) 


a = 


Pg 


Hence, Quinn’s optimal decision rule is determined by the a-quantile of F, in that 
he enters the bet if and only if qa^p > 9. Motivated by the specihc format of the 
optimal decision or Bayes rule, the top left matrix in Table [T] summarizes the payoff 
from just any strategy of the form enter the bet if and only if x > 9. 

It remains to draw the connection to the extremal scoring function To 

this end, we shift attention from positively oriented payoffs to negatively oriented 
regrets, which we dehne as the difference between the payoff for an oracle and 
Quinn’s payoff. Here the term oracle refers to a (hypothetical) omniscient bettor 
who enters the bet if and only if y > 9 realizes, which would yield an ideal payoff 
Pg ~ Pl y > 9, and zero otherwise. If Quinn uses some decision threshold x, 
his regret equals the extremal score S^g{x, y) except for an irrelevant multiplicative 
factor. This is illustrated in the bottom left matrix in the table and corresponds 
to the classical, simple cost-loss decision model (Richardson 2012). In decision 
theoretic terms, the distinction between payoff and regret is inessential, because 
the difference depends on the outcome, p, only. In either case, the optimal strategy 
is to choose the decision threshold x = qa,F- 

In the case of the extremal expectile scoring function in (7|, the payoff is 
real-valued. Specihcally, suppose that Eve considers investing a hxm amount 9 into 
a start-up company, in exchange for an unknown, future amount y of the company’s 
prohts or losses. The payoff structure then is as follows: 


• If Eve refrains from the deal, her payoff will be zero, independently of the 
outcome y. 
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Quantiles 


Expectiles 


Monetary Payoff Monetary Payoff 



y<9 

y> e 


y <9 


y>9 

X <6 

0 

0 

X <9 

0 


0 

X > 6 

-pL 

PG - PL 

X > 9 

-(1 - kl){9 - y) 

(1 

- KG)iy - 9) 

Score (Regret) 


Score (Regret) 



y <9 

y> 6 


y <9 


y> 9 

X < 9 

0 

PG - PL 

X <9 

0 

(1- 

KG){y - 9) 

X > 6 

PL 

0 

X > 9 

(1 - kl)(6» - y) 


0 


Table 1: Overview of payoff structures for decision rules of the form enter the bet/invest if 
and only if x > 6 . Monetary payoffs are positively oriented, whereas scores are negatively 
oriented regrets relative to an oracle. In the left column, the regret equals the extremal 
score S^g(x, y), where a = {pg~Pl)/PG: up to a multiplicative factor. In the right column, 
the regret is S^ 0 {x,y), where a = (1 — kg)/(2 — kg — hl), again up to a multiplicative 
factor. 


• If Eve invests and y < 9 realizes, her payoff is negative, at —(1 — kl)(^ ~ y)- 
Here, 6 — y is the sheer monetary loss, and the factor 1 — kl accounts for Eve’s 
reduction in income tax, with kl € [0,1) representing the deduction ratej^ 

• If Eve invests and y > 6 realizes, her payoff is positive, at (1 — tiG){y ~ 9), 
where kg ^ [0,1) denotes the tax rate that applies to her profits. 


How should Eve act under this payoff scheme? If Eve does not enter the deal, 
her actual and expected payoffs vanish. In case she invests, the expected payoff is 

/ 6 poo 

{9 - y) (lF{y) + (1 - kg) / {y - 9) <fF{y). 

■oo J 9 


This expression is strictly positive if and only if the expectile at level 


I- Kg 

a = - - 

2 — Kg — Kl 


6 ( 0 , 1 ) 


( 18 ) 


of Eve’s predictive CDF, F, exceeds 9. In analogy to the quantile case, the top 
right matrix in Table represents Eve’s payoff from just any strategy of the form 
invest if and only if x > 9. 

^In financial terms, the loss acts as a tax shield. The linear functional form assumed here is 
not unrealistic, even though it is simpler than many real-world tax schemes, where nonlinearities 
may arise from tax exemptions, progression, etc. 
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To relate to the extremal scoring function we again shift attention to regrets 
relative to an omniscient investor or oracle who enters the deal if and only ii y > 9 
occurs, which would yield the ideal payoff (1 — KG){y ~ ^)+- As seen in the table, 
if Eve uses the threshold x to determine whether or not to invest, the regret equals 
the extremal score S^Q{x,y), up to a multiplicative factorj^ 

Therefore, expectiles can be interpreted as optimal decision thresholds in invest¬ 
ment problems with fixed costs and differential tax rates for profits versus losses. 
The mean arises in the special case when a = 1/2 in ( [I^ . It corresponds to situa¬ 
tions in which losses are fully tax deductible {kg = kl) and nests situations without 
taxes (kg = = 0). Tough taxation settings where kl < i^g shift Eve’s incentives 

toward not entering the deal and correspond to expectiles at levels a < 1/2. For 
example, if losses cannot be deducted at all {kl = 0 ), whereas profits are taxed at 
a rate of kg = ^l‘^i Eve will invest only if the expectile at level a = 1/3 of her 
predictive CDF, F, exceeds the deal’s fixed costs, 9. Note that we permit the case 
0 < 0, which may reflect subsidies or tax credits, say. 

The above interpretation of expectiles as optimal thresholds in decision problems 
attaches an economic meaning to this class of functionals, which thus far seems to 
have been missing; e.g., Schulze Waltrup et al. (2014, p. 2 ) note that “expectiles 
lack an intuitive interpretation”. The foregoing may also bear on the debate about 
the revision of the Basel protocol for banking regulation, which involves contention 
about the choice of the functional of in-house risk distributions that banks are 
supposed to report to regulators (Embrechts et al. 2014). Recently, expectiles have 
been put forth as potential candidates, as it has been proved that they are the only 
elicitable law-invariant coherent risk measures (Delbaen et al. 2014; Ziegel 2014; 
Bellini and Bignozzi 2015). 

2.4 Order sensitivity 

The extremal scoring functions and are not only consistent for their re¬ 
spective functional, they in fact enjoy the stronger property of order sensitivity. 
Generally, a scoring function S is order sensitive for the functional F i—)■ T{F) 
relative to the class F if, for all F G F, alH G T(F), and all xi^X 2 G M, 

X2<xi<t EF[S(a;2, E)] > IEf[S(xi, y)]. 


and 

t < xi < X2 Ef[S(xi, E)] < Ef[S(x 2 , E)]. 


The order sensitivity is strict if the above continues to hold when the inequalities 
involving Xi and X 2 are strict. As before, we denote the class of the Borel probability 
measures on M by Fq, and we write Fi for the subclass of the probability measures 
with finite first moment. 


^The elementary score Sg for probability forecasts of a binary event in (14) is obtained when 
Kg = kl = 0 and y £ {0,1}. The parameter 6 £ (0,1) can then be interpreted as a cost-loss ratio. 
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Proposition 2a (quantiles). For every a G (0,1) and 0 G M, the extremal scoring 
function is order sensitive for the a-quantile functional relative to ipQ. 

Proposition 2b (expectiles). For every a G (0,1) and 0 G M, the extremal 
scoring function is order sensitive for the a-expectile functional relative to T\. 

Owing to the mixture representations ([^ and the order sensitivity of the ex¬ 
tremal scoring functions transfers to all regular consistent scoring functions. Strict 
order sensitivity applies if the function g in the representation ([^ and the deriva¬ 
tive (j)' in the representation Q, respectively, are strictly increasing, relative to 
subclasses of probability measures with suitable moment constraints. Closely re¬ 
lated results have recently been obtained in studies of elicitability (Steinwart et 
ah 2014; Ziegel 2014; Bellini and Bignozzi 2015). In this strand of literature, the 
ambitious goal of characterizing all elicitable functionals necessitates regularity con¬ 
ditions that are not satished by our discontinuous, compactly supported extremal 
scoring functions. 


3 Forecast rankings 

In this section, we turn to the task of comparing and ranking forecasts. Before 
applying our mixture representations to this problem, we introduce the prediction 
space setting of Gneiting and Ranjan (2013) and dehne notions of forecast domi¬ 
nance. 


3.1 Prediction spaces 


A prediction space is a probability space tailored to the study of forecasting prob¬ 
lems. Following the seminal work of Murphy and Winkler (1987), the prediction 
space setting of Gneiting and Ranjan (2013) considers the joint distribution of 
forecasts and observations. We hrst focus on probabilistic forecasts, F, which we 
identify with the associated cumulative distribution functions (GDFs) for the real¬ 
valued outcome, Y. The elements of the respective sample space G can be identihed 
with tuples of the form 




(19) 


where the predictive distributions Fi,... ,Fk utilize information sets Mi,..., Ak F 
A, respectively, with A being a sigma held on the sample space G. In measure 
theoretic language, the information sets correspond to sub sigma helds, and Fj is a 
GDF-valued random quantity measurable with respect to Aj. The joint distribution 
of the quantities in (19) is encoded by a probability measure Q on (r2,M). In this 


setting, a predictive distribution Fj is ideal relative to Aj if it corresponds to the 
conditional distribution of the outcome Y under Q given Aj. 

In a nutshell, a prediction space specihes the joint distribution of tuples of the 
form (19). To give an example. Table revisits a scenario studied by Gneiting 
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Forecaster 

Predictive Distribution 

a-Quantile 

Mean 

Prob(F > y) 

Perfect 

Glimatological 

Unfocused 

Sign-reversed 

■X{p, 1) 

A7(0,2) 

1) + Xip + T,l)) 

P- + Za 

y/2Za 

P + 2a,r 

-P + Za 

P 

0 

M + i 

-p 

!-$(?/- p) 

1 -^riy- p) 

1 - ^(y + p) 


Table 2; An example of a prediction space with four competing forecasters. The outcome 
is generated as T | /r ~ AA(0,1), where /U ~ AA(0,1). The random variable r attains the 
values —2 and 2 with probability 1/2, independently of /x and V. For a G (0,1) and 
r G {—2,2}, we let Za = ^t(x) = (<h(x) + <h(x — r))/2, and Za,T = where 

denotes the CDF of the standard normal distribution. 

et al. (2007) and Gneiting and Ranjan (2013) Here, the ontcome is generated 
as H I/i ~ 7\7(0,1) where ju ~ 7V(0,1). The perfect forecaster is ideal relative 
to the sigma field generated by the random variable /i. The nnfocnsed and sign- 
reversed forecasters also have knowledge of /x, bnt fail to be ideal. The climatological 
forecaster, issning the nnconditional distribntion of the ontcome V as predictive 
distribntion, is ideal relative to the nninformative sigma field generated by the 
empty set. 

Any predictive distribntion, F, can be rednced to a point forecast by extracting 
the songht fnnctional, T(F). In what follows, we focns on qnantiles, the mean 
or expectation fnnctional, and probability forecasts of the binary event that the 
ontcome exceeds a threshold valne. The respective point forecasts for the perfect, 
climatological, nnfocnsed, and sign-reversed forecaster are shown in Table 

In practice, point forecasts might be an end to themselves, i.e., they might 
have been issned withont there being an nnderlying predictive distribntion. To 
accommodate snch cases, we define a point prediction space to be a probability 
space (r2,Al, Q), where the elements of the sample space G can be identified with 
tnples of the form 

(Xi,...,X;,r), (20) 

where the random variables Xi, ... ,Xi represent point forecasts and ntilize informa¬ 
tion sets Ai^ ... ,Ai F A^ respectively]^ The joint distribntion of the point forecasts 
and the observation in ( |20| is specified by the probability measnre Q. Similarly, 
it is sometimes usefnl to consider a mixed prediction space, by specifying the joint 
distribntion Q of tnples of the form 

{F,,...,Fk,X,,...,Xi,Y), (21) 

^The only difference is that we let the random variable r attain the values —2 and 2, rather 
than the values —1 and 1 as in Gneiting et al. (2007) and Gneiting and Ranjan (2013). 

'*For simplicity, we let Xi ,..., A/ be single-valued. Extensions to set-valued random quantities, 
as might occur in the case of quantiles, are straightforward. 
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where Fi,... ,Fk represent CDF-valued random quantities, and Xi,..., X; represent 
point forecasts. 


3.2 Notions of forecast dominance 


We now define notions of forecast dominance, starting with probabilistic forecasts 
that take the form of predictive CDFs, and then turning to point forecasts. In the 
former setting, a scoring rule S{F,y) is a suitably measurable function that assigns 
a loss or penalty when we issue the predictive distribution F and y realizes. A 
scoring rule S is proper if 

EgS(G,X) <EgS(F,X) (22) 


for all probability measures F and G in its domain of dehnition (Gneiting and 
Raftery 2007). Proper scoring rules therefore encourage honest and careful as¬ 
sessments. As is well known, a scoring function S that is consistent for a single¬ 
valued functional T relative to a class F induces a proper scoring rule, by dehning 
S(F, y) = S(T(F), y) for F e X and y e M. 

Definition 1 (predictive CDFs). Let Fi and F 2 be probabilistic forecasts, and 
let Y be the outcome, in a prediction space. Then Fi dominates F 2 relative to a 
class V of proper scoring rules if EQS(Fi,y) < EqS(F 2 ,X) for every S eV. 

We now turn to quantiles and expectiles and the respective families and 
of the regular consistent scoring functions for these functionals. 


Definition 2a (quantiles). Let Xi and X 2 be point forecasts, and let Y be the 
outcome, in a point prediction space. Then Xi dominates X 2 as an a-quantile 
forecast if Eq S(Xi, X) < Eq S(X 2 , Y) for every scoring function S G S^. 

Definition 2b (expectiles). Let Xi and X 2 be point forecasts, and let Y be the 
outcome, in a point prediction space. Then Xi dominates X 2 as an a-expectile 
forecast if EqS(Xi,X) < EqS(X 2 ,X) for every scoring function S G S^. 


It is important to note that the expectations in the dehnitions are taken with 
respect to the joint distribution of the probabilistic forecasts and the outcome. 
The notions provide partial orderings for the predictive distributions Fi,... ,Fk in 
(19) and Xi,... ,Xfc in ( [20| , respectively]^ Essentially, a probabilistic forecast that 
dominates another is preferable, or at least not inferior, in any type of decision 
that involves the respective predictive distributions. In the case of quantiles or 
expectiles, a point forecast that dominates another is preferable, or at least not 
inferior, in any type of decision problem that depends on the respective predictive 
distributions via the considered functional only. Adaptations to functionals other 
than quantiles or expectiles are straightforward. 


®In the special case of probability forecasts of a binary event, related notions of sufficiency and 
dominance have been studied by DeGroot and Fienberg (1983), Vardeman and Meeden (1983), 
Schervish (1989), Kramer (2005), and Brocker (2009). 
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Under which conditions does a forecast dominate another? Holzmann and Eulert 
(2014) recently showed that if two predictive distributions are ideal, then the one 
with the richer information set dominates the other. Furthermore, the result carries 
over to ideal forecasters’ induced point predictions, including but not limited to the 
cases of quantiles and expectiles that we consider here. To give an example in the 
setting of Table the perfect and the climatological forecasters are ideal relative 
to the sigma helds generated by /r, and generated by the empty set, respectively. 
Therefore, the perfect forecaster dominates the climatological forecaster, in any of 
the above senses. 

Tsyplakov (2014) went on to show that if a predictive distribution is ideal relative 
to a certain information set, then it dominates any predictive distribution that is 
measurable with respect to the information set. Again, the result carries over to 
the induced point forecasts. In the setting of Table the perfect forecaster is 
ideal relative to the sigma held generated by the random variables /i and r. The 
climatological, unfocused, and sign-reversed forecasters are measurable with respect 
to this sigma held, and so they are dominated by the perfect forecaster, in any of 
the above senses. 

In the practice of forecasting, predictive distributions are hardly ever ideal, and 
information sets may not be nested, as emphasized by Patton (2015). Therefore, 
the above theoretical results are not readily applicable, and distinct soring rules, 
or distinct consistent scoring functions, may yield distinct forecast rankings, as in 
empirical examples given by Schervish (1989), Merkle and Steyvers (2013), and 
Patton (2015), among others. Furthermore, in general it is not feasible to check the 
validity of the expectation inequalities in Dehnitions 1, 2a, and 2b for any proper 
scoring rule S G P, or consistent scoring function S G 5^, or S G iS^, respectively. 

Fortunately, in the case of quantile and expectile forecasts, the mixture repre¬ 
sentations in Theorems la and lb reduce checks for dominance to the respective 
one-dimensional families of elementary scoring functions. 


Corollary la (quantiles). In a point prediction space, Xi dominates X 2 as an 
a-quantile forecast if E(QS^ 0 (Xi,y) < EQS^g,(X 2 ,U) for every 6 * G M. 


Corollary lb (expectiles). In a point prediction space, Xi dominates X 2 as an 
a-expectile forecast if Eq ^(Xi, Y) < Eq 0 (X 2 , Y) for every 6 * G M. 


The reduction to a one-dimensional problem suggests graphical comparisons 
via Murphy diagrams. Before we discuss this tool, we note that order sensitivity 
can sometimes be invoked to prove dominance. For example, consider the mixed 
prediction space setting (21) with k = 2 and I = 1. Suppose that the CDF-valued 
random quantity F is ideal relative to the sigma held A, and let qa,F denote its 
a-quantile. Suppose furthermore that Xi and X 2 are measurable with respect to 
A. By Corollary la in concert with Proposition la and a conditioning argument, 
Xi dominates X 2 as an a-quantile forecast if with probability one either 


X 2 < Xi < qa,F or qa,F < Xi < X 2 
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holds true. An analogous argument applies in the case of the a-expectile. 

In the scenario of Table the argument can be put to work in the case a = 1/2 
that corresponds to median and mean forecasts, respectively. Specifically, let F 
be the perfect forecast, which has median and mean /x, let A be the sigma field 
generated by /x, and let Xi = 0 and X 2 = —/x. Invoking the order sensitivity 
argument, we see that the climatological forecaster dominates the sign-reversed 
forecaster for both median and mean predictions. 


3.3 The Murphy diagram as a diagnostic tool 

As noted. Corollaries la and lb suggest graphical tools for the comparison of quan¬ 
tile and expectile forecasts, including the special cases of the mean or expectation 
functional, and the further special case of probability forecasts of a binary event. 
We describe these diagnostic tools in the setting of a point prediction space (20), 
where Xi,...,W denote point forecasts for the outcome y, and the probability 
measure Q represents their joint distribution. In the case of probability forecasts, 
we use the more suggestive notation pi,... ,pi for the forecasts. 


• For quantile forecasts at level a G (0,1), we plot the graph of the expected 
elementary quantile score 

«^iij(«) = E<,S«yXj,K), (23) 

for j = 1,..., /. By Corollary la, forecast Xj dominates forecast Xj if and 
only if Si{6) < Sj{6) for 0 G M. The area under Sj{6) equals the respective 
expected asymmetric piecewise linear score ([^. 

• For expectile forecasts at level a G (0,1), we plot the graph of the expected 
elementary expectile score 

0^s,(0) = EQS^,,(X„y), (24) 

for X = 1,..., / . By Corollary lb, forecast Xi dominates forecast Xj if and 
only if Si{6) < Sj{6) for 6^ G M. The area under Sj{6) equals half the respective 
expected asymmetric squared error (|^. 

• For probability forecasts of a binary event, we plot the graph of the expected 
elementary score S®, 

0^s,(0)=EQS^(p,-,y), (25) 

for X = 1,...,/. By Corollary lb, the probability forecast Pi dominates pj if 
and only if Si{9) < Sj{9) for 9 G (0,1). The area under Sj{9) equals half the 
expected Brier score (|I5|). 
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Mean 


Mean 



Quantile (a = 0.90) 


Probability {Y > 2) 



Parameter 0 



Figure 2: Murphy diagrams for the forecasters in Table The functionals considered are 
the mean, the quantile at level a = 0.90, and the probability of the binary event T > 2. 
The vertical dashed lines in the bottom panels indicate the extremal scores and 
that relate to each other as in (14) and (16). 


In the context of probability forecasts for binary weather events, displays of 
this type have a rich tradition that can be traced to Thompson and Brier (1955) 
and Murphy (1977). More recent examples include the papers by Schervish (1989), 
Richardson (2000), Wilks (2001), Mylne (2002), and Berrocal et ah (2010), among 
many others. Murphy (1977) distinguished three kinds of diagrams that reflect the 
economic decisions involved. The negatively oriented expense diagram shows the 
mean raw loss or expense of a given forecast scheme; the positively oriented value 
diagram takes the unconditional or climatological forecast as reference and plots the 
difference in expense between this reference forecast and the forecast at hand, and 
lastly, the relative-value diagram plots the ratio of the utility of a given forecast and 
the utility of an oracle forecast. The displays introduced above are similar to the 
value diagrams of Murphy, and we refer to them as Murphy diagrams. Our Murphy 
diagrams are by default negatively oriented and plot the expected elementary score 
for competing quantile, expectile, and probability forecasters. For better visual 
appearance, we generally connect the left- and right-hand limits at the jump points 
of the empirical score curves. 
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Figureshows Murphy diagrams for the perfect, climatological, unfocused, and 
sign-reversed forecasters in Table We compare point predictions for the mean or 
expectation functional, and the quantile at level a = 0.90, along with probability 
forecasts for the binary event that the outcome exceeds the threshold value 2 . An¬ 
alytic expressions for the respective expected scores are given in Appendix B. As 
proved in the previous section, the perfect forecaster dominates the other forecasters 
for all functionals considered. The expected score curves for the climatological and 
the unfocused, and for the unfocused and the sign-reversed forecasters, intersect in 
all three cases, so there are no order relations between these forecasters. Finally, 
the Murphy diagrams suggest that the climatological forecaster dominates the sign- 
reversed forecaster for all three functionals, and in the case of the mean functional, 
the order sensitivity argument in the previous section conhrms the visual impres¬ 
sion. In the cases of the quantile and probability forecasts, hnal conhrmation would 
need to be based on tedious analytic investigations of the asymptotic behavior of 
the expected score functions. 

By default, our Murphy diagrams show the expected elementary scores. If in¬ 
terest focuses on binary comparisons, it is natural to consider Murphy diagrams for 
the difference, 

e ^ D{e) = EqS0(Xi, F) - EqS 0 (X 2 , Y). (26) 

between the expected elementary scores of two point forecasters. 


3.4 Murphy diagrams for empirical forecasters 


We now turn to the comparison and ranking of empirical forecasts. Specihcally, we 
consider tuples 


{xiii ..., Xii^ yi) , 


i = 1,... ,n, 


(27) 


where xij ,..., Xnj are the jth forecaster’s point predictions, for j = 1 ,..., I, and 
yi,... ,yn, are the respective outcomes. Thus, we have I competing forecasters, and 
each of them issues a set of n point predictions. A convenient interpretation of the 
empirical setting is as a special case of a point prediction space, in which the tuples 


(Ai,... ,Xi, Y) in (20) attain each of the values in (27) with probability 1/n. Then 
the probability measure Q is the corresponding empirical measure, and with this 
identihcation, the (average) empirical scores 




2 = 1 


where is either or S®, become the expected elementary scores from 

(23), (24), and (25), respectively. To compare forecasters Xi and X 2 , say, it is 


convenient to show a Murphy plot of the equivalent of the difference (26), namely 


e^Dn{e) = -y^d,{e), 

n ^ 


2=1 
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Figure 3; The general shape of the score differential di{9) in (28) for the median and 
mean functionals. 


where 

di{0) = Se{xii, yi) - S0(a;i2, Vi) (28) 

for i = 1,..., n, and again Se is either g, or Sg , respectively. 

Murphy diagrams can be used efficiently to show a lack of domination when 
forecasters’ expected elementary score curves intersect. However, in general it is 
not possible to conclude domination, unless the visual impression is supported by 
tedious analytic investigations of the behavior of the expected score functions as 
9 —>■ ±cxD. Fortunately, these complications do not arise in the empirical case, 
where dominance can be established by comparing the empirical score functions at 
a well-dehned, hnite set of arguments only, as follows. 

Corollary 2a (quantiles). An empirical forecast Xi dominates X2 for a-quantile 
predictions if 

n 1 

- yi)<YY Vi) 

i=l i=l 

for 0 e {Xii,Xi2,yi, ■ ■ ■ ,Xnl,Xn 2 ,yn}- 

Corollary 2b (expectiles). An empirical forecast Xi dominates X2 for a-expectile 
predictions if 

n 1 ^ 

- Y yi)<-Y yi) 

n ’ n 

i=l i=l 

for 6 e {xiiiXi2iyi, ■ ■ ■ ,Xni,Xn2,yn} CLnd in the left-hand limit as 9 f 9 q E {a;ii,Xi2, 

... ,Xni^Xn2}- In the case a = 1/2 evaluations at 9 E {yi, ...,?/„} can be omitted. 

To see why these results hold, note that in either case the score differential di{9) 
is right-continuous, and that it vanishes unless min(xii,3^12) ^ 9 < max(a;ii,a;j2)- 
Furthermore, in the case of quantiles di{9) is piecewise constant with no other jump 
points than Xii,Xi2, or y^. Similarly, in the case of expectiles di{ 9 ) is piecewise linear 
with no other jump points than xn and Xi2, and no other change of slope than at 
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ID 3 


ID 6 


ID 8 ID 5 


ID 10 
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Parameter 0 


Parameter 9 


Figure 4: Left: Murphy diagram for the probability forecasters in Table A.l of Merkle 
and Steyvers (2013). Right: The best forecast ID(s) under Sg , with dark blue indicating a 
unique best score, and light blue a shared best score. For example, ID 9 attains the unique 
best score for 6 G [0.02,0.04), and ID 10 attains the shared best score for 9 G [0.91,1). 

Hi. The change of slope disappears when 0 = 1/2. Figure [^illustrates the behavior 
of di{9) in the cases of the median and the mean, respectively. 

To give an example, we consider the 10 forecasters in Table A.l of Merkle and 
Steyvers (2013), each of whom issues probability forecasts for 21 binary events. The 
data are artificial but mimic forecasters in the Aggregate Contingent Estimation 
System (ACES), a web based survey that solicited probability forecasts for world 
events from the general public. The Murphy diagram in the left-hand panel of 
Figure [^ shows the empirical score curves 

1 

e ^ Sj (^) = ^ (py, Pi), 

i=\ 

where pij G [0,1] is forecaster j’s stated probability for world event i to materialize, 
and Hi G {0,1} is the respective binary realization. By Corollary 2b, dominance 
relations can be inferred by evaluating Sj{d) at the forecasters’ stated probabili¬ 
ties. We note that ID 3 dominates IDs 6 and 8, and that ID 5 dominates ID 10. 
The remaining pairwise comparisons do not give rise to dominance relations. The 
induced partial order between the IDs applies to comparisons under any proper 
scoring rule, as reflected by the rankings in Table 1 of Merkle and Steyvers (2013). 
The right-hand panel in Figure [^ considers joint comparisons. We see that ID 3 
attains the lowest score over a wide range of 9. However, IDs 2, 5, 7, and 9 show 
the unique best empirical score under Sg for other values of 9 and, therefore, have 
superior economic utility under the associated cost-loss ratios. 

4 Empirical examples 

We now demonstrate the use of Murphy diagrams in economic and meteorological 
case studies in time series settings. In each example, interest is in a comparison of 
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two forecasts, and so we show Murphy diagrams for the empirical scores and their 
difference. The jagged visual appearance stems from the behavior of the empirical 
score functions just explained and depends on the number n of forecast cases. We 
supplement the Murphy diagrams for a difference by conhdence bands based on 
Diebold and Mariano (1995) tests with a heteroscedasticity and autocorrelation 
robust variance estimator (Newey and West 1987). The approach of Diebold and 
Mariano (1995) views empirical data of the form ( [2^ as a sample from an underlying 
population and tests the hypothesis of equal expected scores. The conhdence bands 
are pointwise and have a nominal level of 95%. 

4.1 Mean forecasts of inflation 

In macroeconomics, subjective expert forecasts often compare favorably to statis¬ 
tical forecasting approaches; see Faust and Wright (2013) for evidence and discus¬ 
sion. For the United States, the Survey of Professional Forecasters (SPF) run by 
the Federal Reserve Bank of Philadelphia is a key data source; see, e.g., Engelberg 
et ah (2009). Patton (2015) uses SPF data to illustrate the use of various scoring 
functions that are consistent for the mean functional. 

Motivated by Patton’s analysis, we analyze SPF mean forecasts for the annual 
inhation rate of the Consumer Price Index (CPI). We compare the SPF forecasts to 
forecasts from another survey, the Michigan Survey of Consumers, based on data 
from the third quarter of 1982 to the third quarter of 2014, for a test period of 129 
quarters. Our implementation choices are as in Section 5 of Patton (2015), except 
that we update the data set to cover the observations for the second and third 
quarters in 2014, and that we use the slightly newer fourth quarter of 2014 vintage 
for the CPI realizations. The top panel of Figure shows the forecasts along with 
the realizing values. 

The respective Murphy diagrams are shown in the top panel of Figure]^ At left, 
the curves for the empirical elementary score Sf /20 of tho SPF and the Michigan 
survey intersect prominently, suggesting that neither of the two surveys dominates 
the other. Specihcally, the SPF is preferred for smaller values, whereas the Michigan 
survey is preferred for larger values of 9. This may be explained by a series of high 
inflation rates up until 1992, which were better matched by the Michigan survey 
than by the SPF. At right, the conhdence bands for the score differences are fairly 
broad and include zero for all values of 9. 

4.2 Probability forecasts of recession 

We now relate to the rich literature on binary regression and prediction and ana¬ 
lyze probability forecasts of United States recessions, as proxied by negative real 
gross domestic product (GDP) growth. The SPF covers probability forecasts for 
this event since the fourth quarter of 1968. Following Rudebusch and Williams 
(2009), we compare current quarter probability forecasts from the SPF to forecasts 
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Mean Inflation, Patton (2015); n = 129 



Probability of Recession, Rudebusch and Williams (2009); n = 186 
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90% Quantile of Wind Speed, Gneiting et al. (2006); n = 5136 
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Figure 5: Point forecasts and realizations in the empirical examples. In the middle 
plot, shaded areas correspond to actual recessions. The plot at bottom is restricted to a 
subperiod in summer 2003. 
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Figure 6: Murphy diagrams in the empirical examples. In the right column, a negative 
score difference means that the first named method is preferable. 
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from a probit model based on the term spread, i.e., the difference between long 
and short term interest rates. We follow Rudebusch and Williams (2009) in all 
data and implementation choices, except that we update their sample through the 
second quarter of 2014, for a test period of 186 quarters. Detailed economic and/or 
statistical justification of these choices can be found in the original paper. 

The middle row of Figure shows the SPF and probit model based probability 
forecasts for a recession, with the gray vertical bars indicating actual recessions. 
During recessionary periods, the SPF tends to assign higher forecast probabilities 
than the probit model. Also, the SPF tends to assign lower forecast probabilities 
during non-recessionary periods. The respective Murphy diagrams in the middle 
row of Figure]^ show that the SPF attains lower empirical elementary scores S^ at 
all thresholds 6 G (0,1). The confidence bands for the score differences exclude zero 
for small values of the cost-loss ratio 6 and confirm the superiority of the SPF over 
the probit model for current quarter forecasts. This can partly be attributed to 
the fact that SPF panelists have access to timely within-quarter information that 
is not available to the probit model. As demonstrated by Rudebusch and Williams 
(2009), the relative performance of the probit model improves at longer forecast 
horizons, where within-quarter information plays a lesser role. 

4.3 Quantile forecasts for wind speed 

We return to the meteorological example in Figure [T| but instead of the mean or 
expectation functional we now consider quantile forecasts at level a = 0.90. We 
compare the regime-switching space-time (RST) approach introduced by Gneiting 
et ah (2006) to a simple autoregressive (AR) benchmark for two-hour ahead fore¬ 
casts of hourly average wind speed at the Stateline wind energy center in the Pacific 
Northwest of the United States. The original paper refers to the specifications con¬ 
sidered here as RST-D-CH and AR-D-CH, respectively. This terminology indicates 
that the methods account for the diurnal cycle and conditional heteroscedasticity. 
The data set, evaluation period, estimation and forecast methods for this example 
are identical to those in Gneiting et ah (2006), and we refer to the original paper for 
detailed descriptions. Both methods yield predictive distributions, from which we 
extract the quantile forecasts. The evaluation period ranges from May 1 through 
November 30, 2003, for a total of 5,136 hourly forecast cases. 

The bottom panel in Figure shows the quantile forecasts and realizations. 
The quantile forecasts exceed the outcomes at about the nominal level, at 89.7% 
for the RST forecast and 90.9% for the AR forecast, respectively, indicating good 
calibration. However, the RST forecasts are sharper, in that the average forecast 
value over the evaluation period is 9.2 meters per second, as compared to 9.7 meters 
per second in the case of the AR forecast. To see why the sharpness interpretation 
applies here, note that wind speed is a nonnegative quantity, so quantile forecasts 
can be identified with one-sided prediction intervals with a left limit of zero. These 
observations suggest the superiority of the RST forecasts over the benchmark AR 
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forecasts, and the Murphy diagrams for the empirical elementary scores S^gg q in 
the bottom row of Figure confirm this intuition, in line with what we saw in 
Figure for the mean functional. 


5 Discussion 

We have studied mixture representations of Choquet type for the scoring functions 
that are consistent for quantiles and expectiles, respectively, including the ubiqui¬ 
tous case of the mean or expectation functional, and nesting probability forecasts 
for binary events as a further special case. A particularly interesting aspect of these 
results is that they allow for an economic interpretation of consistent scoring func¬ 
tions in terms of betting and investment problems. Our interpretation of expectiles 
as optimal decision thresholds in investment problems with fixed costs and differ¬ 
ential tax rates appears to be original and may bear on the current debate about 
the revision of the Basel protocol for banking regulation. 

From a general applied perspective, Gneiting (2011, p. 757) had argued that if 
point forecasts are to be issued and evaluated, 

“it is essential that either the scoring fnnction be specified ex ante, or an 
elicitable target fnnction be named, snch as the mean or a qnantile of the 
predictive distribntion, and scoring fnnctions be nsed that are consistent for 
the target fnnctional.” 

Patton (2015, p. 1) took this argument a step further, by positing that 

“rather than merely specifying the target fnnctional, which narrows the set of 
relevant loss fnnctions only to the class of loss fnnctions consistent for that 
fnnctional [... ] forecast consnmers or snrvey designers shonld specify the 
single specific loss fnnction that will be nsed to evalnate forecasts.” 

This is a very valid point. Whenever forecasters are to be compensated for their 
efforts in one way or another, the scoring function ought to be disclosed. To give an 
example of this best practice, the participants of forecast competitions hosted on 
the Kaggle platform (www.kaggle.com) are routinely informed about the relevant 
scoring function prior to the start of the competition. See, e.g., Hong et ah (2014) 
for a description of the Global Energy Forecasting Gompetition 2012. 

However, there remain many situations in which point forecasters receive direc¬ 
tives in the form of a functional, without an accompanying scoring function being 
available. This might be, because the forecasts are utilized by a myriad of communi¬ 
ties, a situation often faced by national and international weather centers, because 
costs and losses are unknown or confidential, because the goal is general method¬ 
ological development, as opposed to a specific applied task, because interest centers 
on an understanding of forecasters’ behaviors and performance, or simply because 
of negligence of best practices. In such settings, our findings suggest the routine 
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use of new diagnostic tools in the evaluation and ranking of forecasts, which we call 
Murphy diagrams. Interest sometimes centers on decompositions of expected or 
empirical scores into uncertainty, resolution, and reliability components, as studied 
by DeGroot and Fienberg (1983), Brocker (2009), and Bentzien and Friederichs 
(2014), among others. Extensions of Murphy diagrams in these directions may be 
worthwhile. 

Our results also bear on estimation problems, in that scoring functions connect 
naturally to M-estimation (Huber 1964; Koltchinskii 1997). An interesting observa¬ 
tion is that the loss functions that have traditionally been employed for estimation 
in quantile regression, ordinary least squares regression, and expectile regression, 
namely the asymmetric piecewise linear and squared error scoring functions (|^ and 

(8) , correspond to the choice of the Lebesgue measure in the mixture representations 

(9) and (0, respectively. This is in contrast to binary regression, where estimation 

is typically based on the logarithmic score, which corresponds to the choice of the 
infinite measure with density h{6) = — in the mixture representation (13), 

rather than the Lebesgue or uniform measure that yields (half) the Brier score (15). 
Quite generally, this raises the question of the optimal choice of the loss or scoring 
function to be used for estimation in regression problems. Focusing on the binary 
case. Hand and Vinciotti (2003), Buja et ah (2005), Lieli and Springborn (2013) 
and Elliott et ah (2015) have considered the use of economically motivated criteria. 
The interpretations developed in the present paper can help design economically 
motivated criteria in more general settings. 

Mixture representations of Choquet type can be found for other, more general 
classes of consistent scoring functions. For instance, our results extend to the class of 
functionals known as generalized quantiles or M-quantiles (Breckling and Chambers 
1988; Koltchinskii 1997; Bellini et ah 2014; Steinwart et ah 2014), which subsume 
both quantiles and expectiles. Related, but more complex mixture representations 
apply in the case of scoring functions that are consistent for multi-dimensional 
functionals, as recently studied by Fissler and Ziegel (2015). 

An interesting question is whether there might be mixture representations in 
terms of economically interpretable elementary scores for proper scoring rules. As 
noted, a scoring rule S{F,y) assigns a loss or penalty when we issue the predictive 
CDF F and y realizes, and for a scoring rule to be proper, the expectation inequal¬ 
ities in (22) need to hold. As we have seen, a predictive distribution for a binary 
variable can be identified with a probability forecast, so the representation (13) 
applies and the answer is well known to be positive in this case. However, an ex¬ 
tension from probability forecasts of binary to ternary or general discrete variables 
does not appear to be feasible, due to results by Johansen (1974) and Bronshtein 
(1978) in convex analysis]^ Despite this negative result, a closer look at a popular 


®In a nutshell, Savage (1971) showed that in the case of fc -I- 1 categories, the proper scoring 
rules for probability forecasts essentially are parameterized by the convex functions on the unit 
simplex in Johansen (1974) and Bronshtein (1978) proved that if A: > 2 then the extremal 
members of that class lie dense. 
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score is encouraging. Specifically, the widely used continuous ranked probability 
score (CRPS; Matheson and Winkler 1976), 


S{F,y)= / {F{9)-l{9>y)fd9, 


equals the integral of the Brier score (15) for the induced probability forecast, 
namely F{9), of the binary event {Y < 9} over all thresholds 0 G M. For simplic¬ 
ity, let us assume that F has unique quantiles. We may then invoke the mixture 
representation (13) along with the relationships (|l^ and (16) to yielcQ 


S ( F , 2/)=2 


^+oo /»! 


r*+oo /•! 


S°(F(0),1(0 >2/) dad9 = 2 


^a9i9a,F,y) do d^. 


Depending on the order of integration, the mixture representation recovers the 
quantile or the threshold decomposition of the CRPS (Gneiting and Ranjan 2011) 
after evaluating the first integral. More complex weighting schemes depending on 
9 and a can be employed, for a general family of proper scoring rules that can be 
economically motivated and justified. Related ideas have recently been put forward 
in the hydrologic and meteorological literatures (Laio and Tamea 2007; Bradley and 
Schwartz 2011; Smet et ah 2012). 
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Appendix A: Proofs 

The specihc structure of the scoring functions in (|^ and ([^ permits us to focus on 
the case a = 1/2 in the subsequent proofs, with the general case a G (0,1) then 
being immediate. 


A1 Proof of Theorems la and lb 


In the case of quantiles, the mixture representation ([^, the fact that dH{9) = dg{9) 
for 6^ G M, and the relationship H{x) — H{y) = S{x,y)/{1 — a) for x > y, are 
straightforward consequences of the fact that for every g E T and x, y G M, 

/ -l-oo 

{1(0 < x) - 1(0 < y)}dg{9). 

■OO 

As the increments of H are determined by S, the mixing measure is unique. 

Turning now to the case of expectiles, we associate with any function 0 G C the 
Bregman type function of two variables 

4)(x, 2/) = 0(2/)-0(x)-0'(x)(2/-x) (x, 2 /GM). (29) 


Then the mixture representation (11), the fact that dH{9) = d0'(0) for 0 G M, 
and the relationship H{x) — H{y) = d 2 S{x,y)/{l — a) for x > y, are immediate 
consequences of the fact that for all 0 G C and x < y. 


Hx,y) = {y- 0)0'(0) 


+ 


0'(0) d0 


9=x 


= / {y-9) d0'(0) = 2 


r‘+oo 


Sf„j{x,y)d4,'{0). 
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The case x > y is handled analogously, and the case x = y is trivial. Finally, as the 
increments of H are determined by S, the mixing measure is unique. 


A2 Proof of Propositions la and lb 


In the case of the elementary quantile scoring function (12), suppose that = 
(Si + S2)/2, where Si and S 2 are of the form ([^ with associated functions gi,g 2 G X\. 
Then 

( {9i{x) - gi{y)) + Mx) - g 2 {y))-2, y<e<x, 

0=< { 9 i{x) - gi{y)) + {g 2 {x) - g 2 {y))+ 2, x<e<y, 

( { 9 i{x) - gi{y)) + Mx) - 92 ( 9 )), otherwise. 

As gi,g 2 e Xo we have gj{x) - gj{y) G [0,1] if y < x, and gj{x) - gj{y) G [-1,0] 
if X < y, where j = 1,2. It follows that gi{x) — 9i{y) = 92 {x) — 92 ( 9 ) = 1 
in the first case, gi{x) — 9i{y) = 92 {,x) — 92 ( 9 ) = —1 in the second case, and 
gi{x) — 9 i{y) = 92 {x) — 92 ( 9 ) = 0 in the third case. This coincides with the value 
distribution of g{x) — g{y) when g{x) = 1(6' < x), whence indeed Si = S 2 = S^g. 


In the case of the elementary expectile scoring function (12), suppose that S^g = 
(Si + S 2 )/ 2 , where Si and S 2 are of the form ([^ with associate functions 0i, 02 £ Ci- 
Let $ 1 , $2 be defined as in (29). Then 


$i(a;, y) + $ 2 ( 0 :, y) - 2 Sy 2 A^^ y) = 

Taking left-hand derivatives with respect to y, we obtain 

{(j)[{x) - (j)[{y)) + {(j) 2 {x) - (jy^iy)) - 2 (1(0 < x) - 1(0 < y)) = 0. 

As (jy'i , 02 G Xi, we may apply the same argument as in the quantile case to show 
that 0 ;(x) - (j)[{y) = 0 ' 2 (x) - (jy^iv) = 1(0 < x) - 1{6 < y), whence Si = S 2 = S^g. 


A3 Proof of Propositions 2a and 2b 

In the case of the elementary quantile scoring function S^g in (10) suppose first 
that X 2 < Xi < qa,F- Since 


Sa,e(^ 2 , 0 ) - S 2 _g(xi, 2 /) = ( 1 ( 1 / < 0) - a) (1(0 < X 2 ) - 1(0 < Xi)), 


we have 

Ei.[Sjg(x 2 ,X)] - Ei.[S«g(xi,F)] = (X(0) - «) (1(0 < X 2 ) - 1(0 < xi)). 

The second factor on the right-hand side vanishes unless 0 G [x 2 ,Xi), and under 
this latter condition we have X(0) < a and 1(0 < X 2 ) — 1(0 < xi) = —1, whence 
the desired expectation inequality. The case qa,F < xi < X 2 is handled analogously. 
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Forecast 

a-Quantile 

Mean 

F 

EqS^giq, 

Y) 

Eq Y) 

Perfect 

Qaft + a^{9 - Za) + J^^^Ag(x)dx 

eg - 9^{9) - (p{9) 

Climatological 

aa,e -l-min(^$(^),Q;^ 

eg -91(9 > 0) 

Unfocused 

^a,9 H” 

a$(0 - Za,T) + ^ Ag(x)dx 

c,-E.[0$(0-r)+^(0_|)] 

Sign-reversed 

Qap + a$(6> - zYj + ^ Ae(x)da; 

eg - 9^(9) + p(9) 


Table 3: Expected extremal scores in the prediction space example of Table For 
a G (0,1) and 0 G M, we let aa,e = —a^{9/y/2), Te(x) = — x)(p{x), and cq = 

9^{9/y/2) + y/^LpiJ)!y/2)^ where and ip denote the CDF and the probability density 
function of the standard normal distribution, respectively. 


In the case of the elementary expectile scoring function ^ in (12) we assume 
first that X 2 < Xi < t, where t denotes the a-expectile of F. Since 


Sa, 0 ( 2 ^ 2 , y) - y) = ((1 -a){e- y)+ - a{y - 6)+) (1(0 < xa) - 1(0 < Xi)), 


we get 

= ((1 - a)E^(0 - F)+ - aEpiY - 0)+) (1(0 < xa) - 1(0 < Xi)). 


As the hrst term on the right-hand side is strictly increasing in 0 and has a unique 
zero at the a-expectile of F, the proof can be completed in the same way as above. 


Appendix B: Details for the synthetic example 


Here we give details for the synthetic example introduced in Table and discussed 
throughout Section Table shows analytic expressions for the expected score 

EqS{T{F),Y) 


where F is either the perfect, the climatological, the unfocused, or the sign-reversed 
forecaster, and the functional T(F) is either the a-quantile, qa,F, or the mean, fip, 
of the CDF-valued random quantity F. The scoring function S is the elementary 
quantile scoring function S^^ in (10) or the elementary scoring function Sf /2 0 iri 
(12). For example, if X is a quantne forecast for Y at level a G (0,1) then 

Eq s2,(X, Y) = -a QiY < 0) + a Q(X < 0) + Q(X > 0, X < 0). (30) 

decomposes into three terms, the hrst depending on the outcome only, the second 
depending on the forecast only, and the third accounting for the joint distribution. 


In view of the relationships (14) and (16), the foregoing covers the case of the 
extremal scoring function for event probabilities, too. 
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